Information content versus word length in random 
typing 



Ramon Ferrer-i-Cancho 1 * and Fermin Moscoso del Prado 
Martin 2 ' 3 ' 4 

1 Complexity & Quantitative Linguistics Lab 
Departamcnt de Llenguatges i Sistemes Informatics, 

TALP Research Center, Universitat Politecnica de Catalunya, 
Campus Nord, Edifici Omega Jordi Girona Salgado 1-3. 
08034 Barcelona, Catalonia (Spain) 

2 Laboratoire de Psychologie Cognitive (UMR6146) 
CNRS & Aix-Marseille Universite I, Marseille, France 

3 Laboratoire Dynamique du Langage (UMR5596) 
CNRS & Universite de Lyon II, Lyon, France 

Institut Rhone-Alpin de Systemes Complexes, Lyon, France 

E-mail: rf errericancho@lsi .upc . edu and fermin .moscoso-del-prado@gmail . com 

Abstract. Recently, it has been claimed that a linear relationship between a measure 
of information content and word length is expected from word length optimization 
and it has been shown that this linearity is supported by a strong correlation between 
information content and word length in many languages (Piantadosi et al. 2011, PNAS 
108, 3825-3826). Here, we study in detail some connections between this measure and 
standard information theory. The relationship between the measure and word length is 
studied for the popular random typing process where a text is constructed by pressing 
keys at random from a keyboard containing letters and a space behaving as a word 
delimiter. Although this random process does not optimize word lengths according to 
information content, it exhibits a linear relationship between information content and 
word length. The exact slope and intercept are presented for three major variants of 
the random typing process. A strong correlation between information content and word 
length can simply arise from the units making a word (e.g., letters) and not necessarily 
from the interplay between a word and its context as proposed by Piantadosi et al. In 
itself, the linear relation does not entail the results of any optimization process. 
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1. Introduction 

In his pioneering research, G. K. Zipf showed that more frequent words tend to be 
shorter [I], and parallels of this brevity law have been reported for the behavior of 
other species |2] |3]. Recently, it has been argued that "average information content 
is a much better predictor of word length than frequency" and that this "indicates 
that human lexicons are efficiently structured for communication by taking into account 
interword statistical dependencies." pfl p. 1]. According to the uniform information 
density hypothesis (e.g., [5]), "language users make choices that keep the number of bits 
of information communicated per unit of time approximately constant" and thus "the 
amount of information conveyed by a word should be linearly related to the amount 
of time it takes to produce -approximately, its length- to convey the same amount of 
information in each unit of time" jU p. 1]. Here it will be shown that hitting keys from a 
keyboard at random (e.g., [HE]) generates words that reproduce this linear relationship. 
Therefore, the observation of such a linear relationship does not constitute unequivocal 
evidence for any kind of optimal choices made by speakers. 

Throughout this paper, C denotes contexts and W denotes words. As in Ref. [4], 
the context of a word consists of a fixed number of preceding words, and the information 
content of a word w is given by 

I( w ) = -J2p(C = c\W = w) \np(W = w\C = c). 

c 

The expected information content of words of length i is defined as jl] 

I(£)= p(W = w\\\w\\ =£)I(w), (1) 

\\w\\=£ 

where \\w\\ is the length (in letters) of a word w and i is a fixed parameter value. In 
this study, we detail some connections between I(w) and standard information theory 
measures. The definition of I(w) that we borrow from Ref. [4] is somewhat idiosyncratic 
in relation to standard information-theory. We found that, Ref. [8], the reference 
supplied in Ref. [1] as a justification for Eq. [lj does not in fact justify the equation 
in any evident way. In this study we demonstrate that I(£) is a linear function of i for 
a general class of random typing processes. The only requirement is that the context 
is defined by means of neighbouring words (as in [I]) or that empty words (words of 
length zero) are allowed as in many variants of the random typing process [61 [91 [TO] . 

2. Connections with standard information theory 

We now introduce our basic notation and conventions. The self-information of an 
event that has probability p is —In p. We consider C and W independent if and 
only if p(C = c,W = w) = p{C = c)p(W — w). As usual, by the definition of 
conditional probability, independence implies both p{C = c\W = w) = p{C = c) 
and p(W = w\C = c) = p(W = w), for any individual c and w. Therefore, under 
independence between C and W, it holds that I(w) = Iq(w) = —\ap(W = w), that is 
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to say, I(w) is just the self- information of w. The expected self-information content of 
a word of length £ is 

I (£)= - J2 p(W = w\\\w\\=£)\np(W = w) 

\\w\\=i 

= - £ piW = w\\\w\\ =£)\np(W = w, \\w\\ =£). (2) 
IMM 

In sum, under independence between C and W, I(£) and ioCO coincide. 
The conditional entropy is defined as, 

H(W\C) = $>(C = c)H(W\C = c) 

c 

= "IXC = c)J2p(W = w\C = c)\np(W = w\C = c). (3) 

c w 

Given only the joint probability, i.e. p(W = w,C = c), one can use Bayes' Theorem 
for calculating the conditional and marginal probabilities, as it was done in previous 
work [4] and is assumed by various information theoretic models of Zipf 's law for word 
frequencies [HI [12]. Simple application of Bayes' Theorem to the definition of H(W\C) 
in (j3J) shows that the conditional entropy is the expectation of I(w): 

H(W\C) = - J2J2p( W = w,C = c) \np(W = w\C = c) 

c w 

= - X2p(W = w) V P( W = ™' C = c ) \ np (w = w\C = c) 
V c p(W = w) 

= ~Y,P( W = w ) Y,P(° = C \ W = w ) ^p(W = w\C = c) 

w c 

= J2p(W = w)I(w) = E[I(w)}. (4) 

w 

It is not difficult to see that Iq(w) is the upper bound of I(w) and H(C\w) is its 
lower bound; formally, 

H{C\w) < I(w) < I (w). (5) 

As for a lower bound of I{w), the relative entropy (or Kullback-Leibler divergence) 
between the context conditional probability and the word conditional probability is 



p(C = c 


W = w) 


p(W = w 


C = c) 



D(p(C = c\W = w)\\p(W = w\C = c)) = Y,p(C = c\W = w)\n- 

c 

= Y^p(C = c\W = w) \np(C = c\W = w) 

c 

-J2p(C = c\W = w) \np(W = w\C = c) 

c 

= I(w)-H(C\w). 

Therefore I(w) > H(C\w) by the non-negativity of the relative entropy p3]. As for 
the upper bound of I(w), the non-negativity of mutual information, i.e. I(W;C) = 
H(W) - H(W\C) > [13] and ®, yields 

H(W\C) < H(W) 
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^2p{W = w)I(w) < - ^2p(W = w) \np(W = w) 

w w 

= J2p(W = w)I (w) 

if and only if I(w) < Iq(w), as we wanted to prove. Combining (CD) and (j5J) results in 

HQ < < W), (6) 
where Ic(£) is defined as 

I c (£) = p(W = w\\\w\\ = l)H{C\w). 

\\w\\=e 

3. Information content versus length in random typing 

Random typing [61 [TO] is a process in which a sequence of characters is produced by 
sampling randomly from a set of possible characters. Here we consider a generalized 
random typing model based upon variants allowing for unequal letter probabilities as in 
[7J [TO] and allowing one to specify a minimum word length [H] . 

Assume that characters are produced from an alphabet E = {a , Oj, <t\-i}, 
where A is the alphabet size, o"o represents the word delimiter (i.e., the space character) 
and the remaining characters of S are letters. We assume that all the characters in £ 
are produced at random and independently, with the only exception that two instances 
of the space character must be separated by at least £q intervening characters other 
than the space. In such model, the production of a word is separated into two phases: 
generation of the space-free prefix of length £ , and generation of the remainder. S is 
a random variable taking values from S as generated by the random typing process. 
Pt,{S = s) is defined as the probability of producing character s as the k-th character 
after the last space produced (or after the beginning of the sequence if no space has been 
produced yet), for any value k > £q. pz\{ ao }(S = s) is the same probability as ps(5 = s) 
for values of k < £q. The abbreviation p — Ps(S — cr ) will be used hereafter. We 
assume that p%(S = s) > for all characters in S with the additional constraint that 
Po < 1- Pz\{a }(S = s ) 1S defined in terms of ps(S = s), 

Pz\{* a }(S = s) = i p° 

{ if S — CT - 

The generalized random typing process with unequal letter probabilities is defined by A 
parameters: £q and the A — 1 probabilities ps(S = <7j) for < % < A — 2 with 

A-2 
i=0 

Notice the additional parameter £ that is not considered in other versions of the random 
typing model and allows for unequal character probabilities [71 [TO]. 

In the remainder of this section we start by proving that Iq(£) is a linear function 
of £, providing exact analytical expressions for its slope and intercept. We continue by 
showing that I(£) can be inferred from Iq(£). If the context is defined by words, as in 
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Ref. [3], then I (£) = I (£) because our generalized random typing process produces words 
independently from the previous ones. If the context are characters, then 1(1) = Iq(£) 
is also warranted when £$ = because this is the case where self-repulsion of the space 
is suppressed. When £ > 0, ^ indicates that I(£) cannot exceed Io(£). 

In order to calculate the probability of producing a concrete word w = 
Si, Si, se, where Sj is the i-th character from E of w, we use the shorthand 

3 

h=i 

By the independence between characters (except for space self-repulsion at distances 
smaller than £ ), the probability that a random word W that has length £ coincides 
with w = si, Si, se is 

p(W = w,\\w\\=£)= (f[p^ {<n} {S = sA ( n Px(S = s l ))p 

\i=i J \i=£ +i J 



Po 



1-Po 



t[PE{S = Si) 



\i=l 



the probability that a word has length £ is 

P (\\w\\ = £) = Po (i - p y- e ° 

and the probability of a word w given its length is therefore 
fw Ml II p\ P (W = w,\\w\\=£) 

p(W = W\\\W\\ =£) = —r 

p(\\w\\ = £) 

Applying (JTj), the self- information of a word w of length £ is 

i 

- Inp(W = w, \\w\\ = £) = b-Y,^px(S = Si), (9) 

i=i 

where b is defined as 

b = \n [ P0) . (10) 
Po 

Combining (jS} and ([9]) with the definition of Iq{£) in (J2]), gives 

/ « w = (t4f, 1 ?/-( 6 -§ 1 ^ (s = s 4 

Bearing in mind that 

E ^= E ••■ E - E Via 

si.-.s/ sies\{ CTo } Sies\{ CTo } s«es\W} 

= E - E - E n^ = ^) 

sie£\{o- } Sie£\{o- } s f 6E\{<r } h=i 
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E - E -ru-i E Ms = st) 
(i-pb) E - E - E ^-i 

sies\{<T } sies\{o- } Sf-igsvjo-o} 

(i-po) 2 E ••■ E - E ?V- 2 

sigs\{<j } s i es\{o- } Sf- 2 es\{cr } 



one can write 



Notice that 



E 'Pi,e{-]np^(S = Si 



91,... ,Si 



E 



J1,...,Sj_1,Sj+1,...,S< 



where 



P:,; \'Pj.\j E -PeC'S' = s i) ^Pn(S = Sj) 

s j es\{<j } 

(if E (5)+p lnpo) E Vij-iPj+W 

Sl,...,Sj-l,Sj + l,...,Sl 

(Hx(S)+ Po ln Po )(l-p o y-\ 



Hx(S)= -Y,Pz(S = s)\np s (S = s) 

sen 



(12) 



(13) 



= - E = s)lnp E (5 = s) -p \np 

s£H\{a } 

is the character entropy after the space-free prefix of length £ Q . Therefore, applying (1121) 
to (fTTj) one finally obtains Io(£) = a£ + b, where 

a = — ^ — (Hx(S) +p \np ) 
1 ~Po 

and b is defined as in (llOjl . Notice that the slope a is always positive because H^(S) > 
as any entropy and, according to ([TBI . H^(S) > p lnp provided that A > 1 (recall 
that no character from £ has probability zero of occurring after the free-space prefix). 
Therefore, Iq{£) grows linearly with i for A > 1. 

Table [1] summarizes the parameters of the linear relationship between Iq{£) for our 
generalized random typing process and two particular cases: (a) equal letter probabilities 
(all characters except the space must be equally likely) [H] and (b) equal character 
probabilities (all characters including the space are equally likely) and empty words are 
allowed, i.e. £ = [9]. Notice that (b) is a particular case of (a). Variant (a) [H] means 
that 

\ Po if s = oo, 
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Table 1. Summary of the linear dependency between the self-information content as 
a function of word length, Iq(1) = a + b, and related quantities for three major variants 
of the random typing process. H-^(S) is the entropy of characters after the free-space 
prefix of length £qj po is the probability of space and A is the cardinality of S. p s is 
used as a shorthand for ps(S = s). 



Random typing 


Generalized Equal letter 


Equal character 


probabilities [H] 


probabilities 




(with £0=00) 



a 
b 

He (5) 
Po 

p(W = w, \\w\\ = £) 
p{W = w\\\w\\ = £) 



1— Po 



(Hx(S)+p Q \n Po ) 

I (l-po)'° 
Po 

-Po lnpo 

Po 

po 



In 



1— Po 
(l-Po)'° 



(1 



1 ,v 



(1-Po)* 



l.C 



Pa 
1— po 

-p Inpo 
Po 

(l-po)(^o>po 



In A 
In A 
In A 



(A-l)' 



and is denned only by three parameters: £q, X and p . The random typing process 
defined in [B] is a particular case with £ = 0. In a random typing process with equal 
letter probabilities, the character entropy after the space-free prefix is 

H*{S) = (A - 1) (-^y In y^) - Po Inpo 

= (1 - po) hi y - po lnpo- 

1 - Po 

Variant (b), the simplest random typing that has ever been presented to our knowledge, 
is defined with only one parameter, i.e. A (£q = and po = 1/A in that case), (b) 
is known as the fair die rolling experiment [9] (see [7] for a version with £q = 1 and 
Po = 1/A). 

4. Conclusion 

We have shown that I{£) = a£ + b does not imply that speakers have made optimal 
choices as argued in [I]. Uniform information density or related hypotheses (e.g., [5]) 
are not at all necesary to account for the linear correlation between 1(1) and £: typing 
at random yields the same dependency independently from context. Our main point 
is that a linear correlation between information content and word length may simply 
arise internally, from the units making a word (e.g., letters) and not necessarily from 
the interplay between words and their context as suggested in [3]. However, future 
research should investigate if the parameters of the linear relationship predicted by 
random typing coincide with those of real texts or if a linear relationship is sufficient to 
account for the actual dependency between I(£) and £ in real languages as it is suggested 
by the long-range correlations in texts at the level of words [15] or letters [HI [17] and 
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the differences between random typing and real language at the level of the distribution 
of word frequencies (HI [18] or word lengths |19j . 
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